This is the R code used to import the data into R.
Seattle <- read_csv("2016-building-energy-benchmarking.csv")
view(Seattle)
The names of the columns are quite self-explanatory; however, we will further define them.
OSEBuildingID
: ID of building BuildingType
: Type and purpose of building PrimaryPropertyType
: Main use or function building PropertyName
: Name of the building on property Address
: Address where building is located City
: City where building is located State
: Statew here building is located ZipCode
: Zipe code of building CouncilDistrictCode
: Council district where building is located Neighborhood
: Colloquial name of city area where building is located Latitude
: Latitudinal coordinate of building Longitude
: Longitudinal coordinate of building YearBuilt
: Year in which the building was built NumberofBuildings
: Number of separate buildings which constitute the entire unit NumberofFloors
: Number of floors which a building contains PropertyGFATotal
: (Gross Floor Area) of Parking PropertyGFABuilding(s)
: (Gross Floor Area) of the building that is non-parking ListOfAllPropertyUseTypes
: Lists all the uses of of the building. This is a non-orthodox way of recording data as there are multiple data in one cell. For the purposes of this project, we do not need you wrangle and tidy this column.
Here, we will set Seattle
as a data from and as a tibble.
Seattle_t <- as_tibble(Seattle)
Seattle_df <- as.data.frame(Seattle)
Let’s identify and characterize the Seattle
table. Here, the head()
function will state which columns are doubles and which are characters. For these data, all the character columns are categorical variables and all the double columns are quantitative except for OSEBuildingID
. Even though OSEBuildingID
is a double, it is a categorical variable in reality since taking an average of the values or even regressing OSEBuildingID
to some other variable would be arbitrary and non-sensical.
#glimpse(Seattle)
head(Seattle)
## # A tibble: 6 × 19
## OSEBuildingID BuildingType PrimaryPropertyT… PropertyName Address City State
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 NonResidenti… Hotel Mayflower p… 405 Ol… Seat… WA
## 2 2 NonResidenti… Hotel Paramount H… 724 Pi… Seat… WA
## 3 3 NonResidenti… Hotel 5673-The We… 1900 5… Seat… WA
## 4 5 NonResidenti… Hotel HOTEL MAX 620 ST… Seat… WA
## 5 8 NonResidenti… Hotel WARWICK SEA… 401 LE… Seat… WA
## 6 9 Nonresidenti… Other West Precin… 810 Vi… Seat… WA
## # … with 12 more variables: ZipCode <dbl>, CouncilDistrictCode <dbl>,
## # Neighborhood <chr>, Latitude <dbl>, Longitude <dbl>, YearBuilt <dbl>,
## # NumberofBuildings <dbl>, NumberofFloors <dbl>, PropertyGFATotal <dbl>,
## # PropertyGFAParking <dbl>, PropertyGFABuilding(s) <dbl>,
## # ListOfAllPropertyUseTypes <chr>
We will now list the unique observations in the ListOfAllPropertyUseTypes
column.
Seattle %>%
count(ListOfAllPropertyUseTypes) %>%
filter(n>1)
## # A tibble: 153 × 2
## ListOfAllPropertyUseTypes n
## <chr> <int>
## 1 Automobile Dealership 2
## 2 Automobile Dealership, Parking 4
## 3 Bank Branch 2
## 4 Bank Branch, Office 3
## 5 Bank Branch, Office, Parking 2
## 6 Bank Branch, Parking 2
## 7 College/University 21
## 8 College/University, Parking 2
## 9 Data Center, Distribution Center, Office, Other 2
## 10 Data Center, Office 7
## # … with 143 more rows
For the sake of this project, we will filter all the Hotel values.
Seattle%>%
filter(ListOfAllPropertyUseTypes %in% c("Hotel",
"Hotel, Multifamily Housing, Parking, Retail Store",
"Hotel, Parking",
"Hotel, Parking, Restaurant",
"Hotel, Parking, Swimming Pool",
"Hotel, Restaurant",
"Hotel, Retail Store",
"Hotel, Swimming Pool"))-> Seattle_hotel
Seattle_hotel %>%
unite(Coordinates, Latitude, Longitude, sep = ",") -> Seattle_hotel
head(Seattle_hotel)
## # A tibble: 6 × 18
## OSEBuildingID BuildingType PrimaryProperty… PropertyName Address City State
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 NonResidenti… Hotel Mayflower pa… 405 Ol… Seat… WA
## 2 2 NonResidenti… Hotel Paramount Ho… 724 Pi… Seat… WA
## 3 3 NonResidenti… Hotel 5673-The Wes… 1900 5… Seat… WA
## 4 5 NonResidenti… Hotel HOTEL MAX 620 ST… Seat… WA
## 5 8 NonResidenti… Hotel WARWICK SEAT… 401 LE… Seat… WA
## 6 10 NonResidenti… Hotel Camlin 1619 9… Seat… WA
## # … with 11 more variables: ZipCode <dbl>, CouncilDistrictCode <dbl>,
## # Neighborhood <chr>, Coordinates <chr>, YearBuilt <dbl>,
## # NumberofBuildings <dbl>, NumberofFloors <dbl>, PropertyGFATotal <dbl>,
## # PropertyGFAParking <dbl>, PropertyGFABuilding(s) <dbl>,
## # ListOfAllPropertyUseTypes <chr>
Here we have explanatory data visualizations. We though that there could be some relationship between NumberofFloors
and YearBuilt
, our guess was that over time buildings got taller. This explanatory analysis strongly suggests that a linear regression cannot be used to study the relationship between NumberofFloors
and YearBuilt
.
ggplot(data = Seattle_hotel) +
geom_point(mapping = aes(x = YearBuilt, y = NumberofFloors))
We used a bar plot to visualize how many hotels are in each neighborhood.
ggplot(data = Seattle_hotel) +
geom_bar(mapping = aes(x = Neighborhood, fill = Neighborhood))
This is a scatter plot which plots PropertyGFATotal
against NumberofFloors
, and color codes points according to Neighborhood
.
ggplot(data = Seattle_hotel) +
geom_point(mapping = aes(x = PropertyGFATotal, y = NumberofFloors, color = Neighborhood)) #+
#ggtitle("Mammal Body Weight & Total Sleep Amount")
This is a regression which studies the relationship between YearBuilt
and PropertyGFABuilding(s)
.
ggplot(data = Seattle_hotel) +
geom_point(mapping = aes(x = YearBuilt, y = `PropertyGFABuilding(s)`))
cor(Seattle_hotel$YearBuilt,Seattle_hotel$`PropertyGFABuilding(s)`)
## [1] 0.1707993
lm_year_building <- lm(YearBuilt ~ `PropertyGFABuilding(s)`, data = Seattle_hotel)
tidy(lm_year_building)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1958. 5.63 348. 1.68e-116
## 2 `PropertyGFABuilding(s)` 0.0000415 0.0000284 1.46 1.49e- 1
This is a regression which studies the relationship between PropertyGFABuilding(s)
and PropertyGFATotal
.
ggplot(data = Seattle_hotel) +
geom_point(mapping = aes(x = `PropertyGFABuilding(s)`, y = PropertyGFATotal))
cor(Seattle_hotel$`PropertyGFABuilding(s)`,Seattle_hotel$PropertyGFATotal)
## [1] 0.9835021
#lm_Hotel_Building_GFA <- lm()
lm_GFA <- lm(`PropertyGFABuilding(s)` ~ PropertyGFATotal, data = Seattle_hotel)
tidy(lm_GFA)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2620. 4246. 0.617 5.39e- 1
## 2 PropertyGFATotal 0.835 0.0182 45.8 1.82e-54
This is a regression which studies the relationship between PropertyGFABuilding(s)
and PropertyGFAParking
.
ggplot(data = Seattle_hotel) +
geom_point(mapping = aes(x = `PropertyGFABuilding(s)`, y = PropertyGFAParking))
cor(Seattle_hotel$`PropertyGFABuilding(s)`,Seattle_hotel$PropertyGFAParking)
## [1] 0.5955192
lm_year_building <- lm(`PropertyGFABuilding(s)` ~ PropertyGFAParking, data = Seattle_hotel)
tidy(lm_year_building)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 79767. 16366. 4.87 0.00000645
## 2 PropertyGFAParking 2.25 0.360 6.25 0.0000000274
#PropertyGFABuilding(s)
This is a regression which studies the relationship between PropertyGFABuilding(s)
and PropertyGFATotal
.
ggplot(data = Seattle_hotel) +
geom_point(mapping = aes(x = `PropertyGFABuilding(s)`, y = PropertyGFATotal))
cor(Seattle_hotel$`PropertyGFABuilding(s)`,Seattle_hotel$PropertyGFATotal)
## [1] 0.9835021
lm_year_building <- lm(`PropertyGFABuilding(s)` ~ PropertyGFATotal, data = Seattle_hotel)
tidy(lm_year_building)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2620. 4246. 0.617 5.39e- 1
## 2 PropertyGFATotal 0.835 0.0182 45.8 1.82e-54